The purpose of this chapter is to provide some background on the nonlinear programming (NLP) problem. Indeed, an exhaustive discussion of both theoretical and practical aspects of nonlinear programming can be the subject matter of an entire book.
There are several reasons for studying nonlinear programming in an optimal control class:
We start by recalling some useful mathematical definition that will be used in the following.
Definition 2.1: Convex Set
A set is said to be convex if
Gemotrically convexity of a set means that every line segment joining two elements of the set belongs to the set
Definition 2.2: Convex Function
A function is convex if is convex and holds that:
Remark 2.1
A function is said strictly convex if the convexity condition holds with strict inequality, that is :
Geometrically it means that the line segment joining and lies above f. Figures 2.1 , 2.2 illustrate the previous definitions.
Definition 2.3: Cone, Convex Cone
A nonempty set is said to be a cone if, for every point ,
If, in addition, is convex then it is said to be a convex cone.
Definition 2.4: Positive definite matrix
A symmetric square matrix is said to be positive definite if
Remark 2.2
Equivalently, positive definiteness implies that all the eigenvalues are positive. In the same way we define semi-positive definite matrices. A symmetric square matrix is said to be positive definite if . To denote positive and semi-positive definite matrices we use the symbol and respectively. If a symmetric matrix has both positive and negative eigenvalues is said to be indefinite.
Definition 2.5: small o
For a function we write if and only if there exists a neighbourhood of (i.e. ) and a function with such that:
Practically it means that if then shrinks faster than as goes to zero. For example . Note that from the definition we also have that
Definition 2.6
Matrix norm
We shall consider the following NLP problem:
Definition 2.7
Nonlinear Program
where is a subset of , is a vector of components , and , and are defined on .
Remark 2.3
The function is usually called the objective function or criterion function or performance index. Each of the constraints , , is called an inequality constraint, and each of the constraints , , is called an equality constraint.
Remark 2.4
Note also that the set typically includes lower and upper bounds on the variables. The reason for separating variable bounds from the other inequality constraints is that they can play a useful role in some algorithms, i.e. they are handled in a specific way.
Remark 2.5
A vector satisfying all the constraints is called a feasible solution to the problem. The collection of all such points forms the feasible region. The NLP problem, then, is to find a feasible point such that for each feasible point .
Needless to say, a NLP problem can be stated as a maximization problem, and the inequality constraints can be written in the form .
Example 2.1. Consider the following problem:
The objective function and the three inequality constraints are:
Figure 2.3 illustrates the feasible region. The problem, then, is to find a point in the feasible region with the smallest possible value of . Note that points with are circles with radius and centre . This circle is called the contour of the objective function having the value . In order to minimize , we must find the circle with the smallest radius that intersects the feasible region. As shown in figure 2.3 , the smallest circle corresponds to and intersects the feasible region at the point . Hence, the optimal solution occurs at the point and has an objective value equal to .
The graphical approach used in the above example, i.e. find an optimal solution by determining the objective contour with the smallest objective value that intersects the feasible region, is only suitable for small problems. It becomes intractable for problems containing contours of the objective function in more than three variables as well as for problems having complicated objective and/or constraint functions.
This chapter is organized as follows:
A variety of different definitions of optimality are used in different contexts. It is important to understand fully each definition and the context within which it is appropriately used.
Definition 2.8: Infimum/Supremum
The infimum of , denoted as , provided it exists, is the greatest lower bound for , i.e. a number satisfying:
Similarly, the supremum of , denoted as , provided it exists, is the least upper bound for , i.e. a number satisfying:
The first question one may ask concerns the existence of infima and suprema in . In particular, one cannot prove that in every set bounded from above has a supremum and every set bounded from below has an infimum. This is an axiom, known as the axiom of completeness:
Axiom 2.1: of Completeness
If a nonempty subset of real numbers has an upper bound, then it has a least upper bound. If a nonempty subset of real numbers has a lower bound, it has a greatest lower bound.
Remark 2.6
It is important to note that the real number (respectively ), with a nonempty set in bounded from below (respectively from above), although it exist, need not be an element of .
Notation 2.1
Let be the image of the feasible set of an optimization problem under the objective function . Then, the notation
refers to the number . Likewise, the notation
refers to .
Clearly, the numbers and may not be attained by the value at any . This is illustrated in an example below.
Remark 2.7
By convention, the infimum of an empty set is while the supremum of an empty set is . That is, if the values are allowed, then infima and suprema always exist.
Consider the standard problem formulation
where denotes the feasible set. Any is said to be a feasible point. Conversely, any is said to be an infeasible point.
Definition 2.9: (Global) Minimum, Strict (Global) Minimum
A point is said to be a (global) a minimum of on if
| (2.1) |
i.e. a minimum is a feasible point whose objective function value is less than or equal to the objective function value of all other feasible points. It is said to be a strict (global) minimum of on if
A (global) maximum is defined by reversing the inequality in the above definition:
Definition 2.10: (Global) Maximum, Strict (Global) Maximum
A point is said to be a (global) maximum of on if
| (2.2) |
It is said to be a strict (global) maximum of on if
Remark 2.8
The important distinction between minimum (respectively maximum) and infimum (respectively supremum) is that the value must be attained at one or more points , whereas the value does not necessarily have to be attained at any points . Yet, if a minimum (respectively maximum) exists, then its optimal value will equal the infimum (respectively supremum).
Remark 2.9
Note also that if a minimum exists, it is not necessarily unique. That is, there may be multiple or even an infinite number of feasible points that satisfy the inequality 2.1 and are thus minima.
Notation 2.2
Since there is in general a set of points that are minima, the notation
is introduced to denote the set of minima. This is a (possibly empty) set a a The notation is also used by some authors. In this case, should be understood as a function returning a point that minimizes on . in .
Remark 2.10
A minimum is often referred to as an optimal solution, a global optimal solution, or simply a solution of the optimization problem. The real number is known as the (global) optimal value or optimal solution value.
Notation 2.3
Regardless of the number of minima, there is always a unique real number that is the optimal value (if it exists). The notation is used to refer to this real value.
Unless the objective function and the feasible set possess special properties (e.g. convexity), it is usually very hard to devise algorithms that are capable of locating or estimating a global minimum or a global maximum with certainty. This motivates the definition of local minima and maxima, which, by the nature of their definition in terms of local information, are much more convenient to locate with an algorithm.
Definition 2.11: Local Minimum, Strict Local Minimum
A point is said to be a local minimum of on if
A point is said to be a strict local minimum of on if
The qualifier ’local’ originates from the requirement that be a minimum only for those feasible points in a neighbourhood around the local minimum.
Remark 2.11
Trivially, the property of being a global minimum implies that is also a local minimum because a global minimum is local minimum with set arbitrarily large.
Definition 2.12: Local Maximum, Strict Local Maximum
A point is said to be a local maximum of on if
A point is said to be a strict local maximum of on if
The qualifier ’local’ originates from the requirement that be a maximum only for those feasible points in a neighbourhood around the local maximum.
Remark 2.12
Trivially, the property of being a global maximum implies that is also a local maximum because a global maximum is local maximum with set arbitrarily large.
Remark 2.13
It is important to note that the concept of a global minimum or a global maximum of a function on a set is defined without the notion of a distance (or a norm in the case of a vector space). In contrast, the definition of a local minimum or a local maximum requires that a distance be specified on the set of interest. In , norms are equivalent, and it is readily shown that local minima (respectively maxima) in ( , ) match local minima (respectively maxima) in (( , ), for any two arbitrary norms and in (e.g. the Euclidean norm and the infinite norm ). Yet, this nice property does not hold in linear functional spaces, as those encountered in problems of the calculus of variations and optimal control.
Figure ?? illustrates the various definitions of minima and maxima. Point is the unique global maximum; the objective value at this point is also the supremum. Points , , and are strict local minima because there exists a neighbourhood around each of these point for which , , or is the unique minimum (on the intersection of this neighbourhood with the feasible set ). Likewise, point is a strict local maximum. Point is the unique global minimum; the objective value at this point is also the infimum. Finally, point is simultaneously a local minimum and a local maximum because there are neighbourhoods for which the objective function remains constant over the entire neighbourhood; it is neither a strict local minimum, nor a strict local maximum.
Example 2.4. Consider the function
The point is a local minimum for
with value . The optimal value is and .
A crucial question when it comes to optimizing a function on a given set, is whether a minimizer or a maximizer exist for that function in that set. Strictly, a minimum or maximum should only be referred to when it is known to exist.
Figure ?? illustrates three instances where a minimum does not exist. In figure ??a, the infimum of over is given by , but since is not closed and, in particular, , a minimum does not exist. In figure ??b, the infimum of over is given by the limit of as approaches from the left, i.e. . However, because is discontinuous at , a minimizing solution does not exist. Finally, figure ??c illustrates a situation within which is unbounded over the unbounded set .
We now formally state and prove the result that if is nonempty, closed, and bounded, and if is continuous on , then, unlike the various situations of figure ??, a minimum exists.
Theorem 2.1: Weierstrass’ Theorem
Let be a nonempty compact set and let be continuous on . Then, the problem attains its minimum, that is, there exists a minimizing solution to this problem.
Proof. Since is continuous on and is both closed and bounded, is bounded below on . Consequently, since , there exists a greatest lower bound (see previous axiom). Now, let , and consider the set for . By the definition of an infimum, for each , and so we may construct a sequence of points by selecting a point for each . Since is bounded, there exists a convergent subsequence indexed by the set . Let denote its limit. By the closedness of , we have and by the continuity of on , since , we have . Hence, we have shown that there exist a solution such that , i.e. is a minimizing solution. □
Remark 2.14
The hypotheses of theorem 2.1 can be justified as follows:
Example 2.5. Theorem 2.1 establishes that a minimum (and a maximum) of
exists, since is a nonempty compact set and is a continuous function on . On the other hand, minima can still exist even though the set is not compact or the function is not continuous, for theorem 2.1 only provides a sufficient condition. This is the case for the problem
which has a minimum at .
Example 2.6. Consider the NLP problem
The objective function being continuous and the feasible region being nonempty, closed and bounded, the existence of a minimum to this problem directly follows from theorem 2.1 .
A particular class of nonlinear programs is that of convex programs:
Definition 2.13: Convex Program
Let be a nonempty convex set in , and let be convex on . Then,
is said to be a convex program (or a convex optimization problem).
Convex programs possess nicer theoretical properties than general nonconvex problems. Theorem 2.2 is a fundamental result in convex programming:
Theorem 2.2
Let be a local minimum of a convex program. Then, is also a global minimum.
Proof. being a local minimum,
By contradiction, suppose that is not a global minimum. Then,
| (2.3) |
Let be chosen such that . By convexity of , is in . Next, by convexity of on and 2.3,
hence contradicting the assumption that is a local minimum. □
Example 2.7. Consider once again the NLP problem
The objective function and the inequality constraints , and being convex, every local solution to this problem is also a global solution by theorem 2.2 . Henceforth, is a global solution and the global solution value is .
In convex programming, any local minimum is therefore a global minimum. This is a powerful result that makes any local optimization algorithm a global optimization algorithm when applied to a convex optimization problem. Yet, theorem 2.2 only gives a sufficient condition for that property to hold. That is, a nonlinear program with nonconvex participating functions may not necessarily have local minima that are not global minima.
Definition 2.14: Unconstrained Programs
An unconstrained program is a problem of the form to minimize (or maximize) without any constraints on the variables :
Remark 2.15
Note that, being the feasible domain of unbounded, theorem 2.1 does not apply. Thus, one does not know with certainty whether a minimum actually exists for that problem a a For unconstrained optimization problems, the existence of a minimum can actually be guaranteed if the objective function is such that (O-coercive function). . Moreover, even if the objective function is convex, one such minimum may not exist (think of !). Hence, we shall proceed with the theoretically unattractive task of seeking minima and maxima of functions which need not have them!
Given a point , necessary conditions help determine whether or not a point is a local or a global minimum of a function . For this purpose, we are mostly interested in obtaining conditions that can be checked algebraically.
Definition 2.15: Descent Direction
Suppose that is continuous at . A vector is said to be a descent direction, or an improving direction, for at if
Moreover, the cone of descent directions at , denoted by , is given by
The definition 2.15 provides a geometrical characterization for a descent direction. Yet, an algebraic characterization for a descent direction would be more useful from a practical point of view. In response to this, let us assume that is differentiable and define the following set at :
This is illustrated in figure ?? where the half-space and the gradient are translated from the origin to for convenience.
Lemma 2.1 proves that every element is a descent direction at .
Lemma 2.1: Algebraic Characterization of a Descent Direction
Suppose that is differentiable at . If there exists a vector such that , then is a descent direction for at . That is,
Proof. being differentiable at ,
where . Rearranging the terms and dividing by , we get
Since and , there exists a such that for all . □
We are now ready to derive a number of necessary conditions for a point to be a local minimum of an unconstrained optimization problem.
Theorem 2.3: First-Order Necessary Condition for a Local Minimum
Suppose that is differentiable at . If is a local minimum, then .
Proof. The proof proceeds by contraposition. Suppose that . Then, letting , we get . By lemma 2.1,
hence contradicting the assumption that is a local minimum for . □
Remark 2.16: Obtaining Candidate Solution Points
The above condition is called a first-order necessary condition because it uses the first-order derivatives of . This condition indicates that the candidate solutions to an unconstrained optimization problem can be found by solving a system of algebraic (nonlinear) equations. Points such that are known as stationary points. Yet, a stationary point need not be a local minimum; it could very well be a local maximum or even a saddle point.
Example 2.8. Consider the problem
The gradient vector of the objective function is given by
which has three distinct roots , and . Out of these values, gives the smallest cost value. Yet, we cannot declare to be the global minimum because we do not know whether a (global) minimum exists for this problem. Indeed, as shown in figure 2.4 , none of the stationary points is a global minimum because decreases to as .
More restrictive necessary conditions can also be derived in terms of the Hessian matrix whose elements are the second-order derivatives of . One such second-order condition is given below.
Theorem 2.4: Second-Order Necessary Conditions for a Local Minimum
Suppose that is twice differentiable at . If is a local minimum, then and is positive semidefinite.
Proof. Consider an arbitrary direction . Then, from the differentiability of at , we have
where . Since is a local minimum, from Theorem 2.3, . Rearranging the terms and dividing by , we get
Since is a local minimum, for sufficiently small. By taking the limit as , it follows that . Since is arbitrary, is therefore positive semidefinite. □
Example 2.9. Consider the problem
The gradient vector of the objective function is given by
so that the only stationary point in is . Now, consider the Hessian matrix of the objective function at :
It is easily checked that is indefinite, . Therefore, by Theorem 2.4 , the stationary point is not a (local) minimum (nor is it a local maximum). Such stationary points are called saddle points (see Figure 2.5 ).
The conditions presented in theorems 2.3 and 2.4 are necessary conditions. That is, they must hold true at every local optimal solution. Yet, a point satisfying these conditions need not be a local minimum. Theorem 2.5 gives sufficient conditions for a stationary point to be a global minimum point, provided the objective function is convex on .
Theorem 2.5: First-Order Sufficient Conditions for a Global Minimum for Convex Functions
Suppose that is differentiable at and convex on . If , then is a global minimum of on .
Proof. being convex on and differentiable at , by Theorem 2.6, we have
But since is a stationary point,
Theorem 2.6: First-Order Condition of Convexity
Let be a convex set in with a nonempty interior, and let be a function. Suppose is continuous on and differentiable on . Then is convex on if and only if
holds for any two points .
Proof. ... □
Theorem 2.7: Second-Order Condition of Convexity
Let be a convex set in with a nonempty interior, and let be a function. Suppose is continuous on and twice differentiable on . Then is convex on if and only if
Proof. ... □
Remark 2.17
The first-order condition of strict convexity is:
While the second-order condition is:
The convexity condition required by theorem 2.5 is actually very restrictive, in the sense that many practical problems are nonconvex. In theorem 2.8, we give sufficient conditions for characterizing a local minimum point, provided the objective function is strictly convex in a neighborhood of that point.
Theorem 2.8: Second-Order Sufficient Conditions for a Strict Local Minimum
Suppose that is twice differentiable at . If and is positive definite, then is a local minimum of .
Proof. being twice differentiable at , we have
for each , where . Let denote the smallest eigenvalue of . Then, being positive definite we have and . Moreover, from we get
Since ,
and finally,
i.e., is a strict local minimum of . □
Example 2.10. Consider the problem
The gradient vector and Hessian matrix at are given by
Note that since and also that is the second-order characterization of strict convexity. Hence, by Theorem 2.8 , is a local minimum of ( is also a global minimum of on since is convex). The objective function is pictured in Figure 2.6 below.
We close this subsection by re-emphasizing the fact that every local minimum of an unconstrained optimization problem is a global minimum if is a convex function on . Yet, convexity of is not a necessary condition for each local minimum to be a global minimum.
In practice, few problems can be formulated as unconstrained programs. This is because the feasible region is generally restricted by imposing constraints on the optimization variables. In this section, we first present theoretical results for the problem that is to minimize , subject to for a general set (geometric optimality conditions). Then, we let be more specifically defined as :
Definition 2.16: Inequality Constrained Program
The feasible region of the NLP is defined by a set of nonlinear inequalities , for the Inequality Constrained Program we derive the Karush-Kuhn-Tucker (KKT) conditions of optimality in the following.
Definition 2.17: Feasible Direction
Let be a nonempty set in . A vector , , is said to be a feasible direction at if
Moreover, the cone of feasible directions at , denoted by , is given by
From the Definition 2.17 and from Lemma 2.1, it is clear that a small movement from along a direction leads to feasible points, whereas a similar movement along a direction leads to solutions of improving objective value (see the definition 2.15). As shown in theorem ??, a (geometric) necessary condition for local optimality is that: ”Every improving direction is not a feasible direction”. This fact is illustrated in figure ?? where both the half-space and the cone (see Definition 2.3) are translated from the origin to for clarity.
Theorem 2.9: Geometric Necessary Condition for a Local Minimum
Let be a nonempty set in and let be a differentiable function. Suppose that is a local minimizer of the problem to minimize subject to . Then, .
We now specify the feasible region as
where , are continuous functions. In the geometric optimality condition given by Theorem 2.12, is the cone of feasible directions. From a practical viewpoint, it is desirable to convert this geometric condition into a more usable condition involving algebraic equations. As Lemma 2.2 indicates, we can define a cone in terms of the gradients of the active constraints at , such that . For this, we need the following:
Definition 2.18: Active Constraint, Active Set
Let
, and consider the set
. Let
be a feasible point. For each
, the constraint
is said to be active or binding at
if
; it is said to be inactive at
if
. Moreover,
denotes the set of active constraints at .
Lemma 2.2: Algebraic Characterization of a Feasible Direction
Let be differentiable functions, and consider the set . For any feasible point , we have
Remark 2.18
This Lemma together with Theorem 2.12 directly leads to the result that for any local solution point , i.e.,
The foregoing geometric characterization of local solution points applies equally well to either interior points
or boundary points being at the boundary of the feasible domain. At an interior point, in particular, any direction is feasible, and the necessary condition
reduces to
, which gives the same condition as in unconstrained optimization (see Theorem 2.3).
Note also that there are several cases where the condition is satisfied by non-optimal points. In other words, this condition is necessary but not sufficient for a point to be a local minimum of on . For instance, any point with for some trivially satisfies the condition . Another example is given below.
Example 2.11. Consider the problem
Clearly, this problem is convex and is the unique global minimum.
Now, let be any point on the line . Both constraints and are active at , and we have . Therefore, no direction can be found such that and simultaneously, i.e., . In turn, this implies that is trivially satisfied for any point on .
On the other hand, observe that the condition in Theorem 2.9 excludes all the points on , but the origin, since a feasible direction at is given, e.g., by .
We now reduce the geometric necessary optimality condition to a statement in terms of the gradients of the objective function and of the active constraints. The resulting first-order optimality conditions are known as the Karush-Kuhn-Tucker (KKT) necessary conditions. Beforehand, we introduce the important concepts of a regular point and of a KKT point.
Definition 2.19: Regular Point (for a Set of Inequality Constraints)
Let be differentiable functions on and consider the set . A point is said to be a regular point if the gradient vectors are linearly independent:
Definition 2.20: KKT Point
Let and be differentiable functions. Consider the problem to minimize subject to . If a point satisfies the conditions:
Remark 2.19
The scalars are called the Lagrange multipliers. The condition 2.4a is called stationarity condition. The condition 2.4b is referred as dual feasibility (DF) condition; the condition 2.4c, i.e., the requirement that be feasible, is called the primal feasibility (PF) condition; finally, the condition 2.4d is called the complementarity slackness (CS) condition a
Theorem 2.10: KKT Necessary Conditions
Let and be differentiable functions. Consider the problem to minimize subject to . If is a local minimum and a regular point of the constraints, then there exists a unique vector such that is a KKT point.
Proof. Since solves the problem, then there exists no direction such that and , simultaneously (see Remark 2.18). Let be the matrix whose rows are and , . Clearly, the statement is false and, by Corollary 2.1, there exists a nonzero vector in such that . Denoting the components of by and for , we get:
where and for and (here is the vector whose components are the ’s for ). Letting for , we then get the conditions:
where is the vector whose components are for . Note that , since otherwise the assumption of linear independence of the active constraints at would be violated a a Note that if then and by Corollary 2.1 we have:
with at least one thus violating the linear independence assumption. . Then, letting , we obtain that is a KKT point. □
Remark 2.20
One of the major difficulties in applying the foregoing result is that we do not know a priori which constraints are active and which constraints are inactive, i.e., the active set is unknown. Therefore, it is necessary to investigate all possible active sets for finding candidate points satisfying the KKT conditions. This is illustrated in the example below.
Example 2.12 (Regular Case). Consider the problem
Note that every feasible point is regular since and , so must satisfy the stationarity conditions:
Four cases can be distinguished:
Overall, there is a unique candidate for a local minimum. Yet, it cannot be concluded as to whether this point is actually a global minimum or even a local minimum.
Remark 2.21: Constraint Qualification
It is very important to note that for a local minimum to be a KKT point, an additional condition must be placed on the behaviour of the constraint, i.e., not every local minimum is a KKT point, such a condition is known as a constraint qualification. In Theorem 2.10 it is shown that one possible constraint qualification is that be a regular point, which is the well known linear independence constraint qualification (LICQ). A weaker constraint qualification (i.e., implied by LICQ) known as the Mangasarian-Fromovitz constraint qualification (MFCQ) requires that there exits (at least) one direction , i.e. such that , for each . Note, however, that the Lagrange multipliers are guaranteed to be unique if LICQ holds (as stated in Theorem 2.10),while this uniqueness property may be lost under MFCQ.
The following example illustrates the necessity of having a constraint qualification for a KKT point to be a local minimum point of an NLP.
Example 2.13 (Non-Regular Case). Consider the problem
The feasible region is shown in fig. 2.7 below. Note that a minimum point is . The stationary condition relative to variable computed at reads:
It is readily seen that this condition cannot be met at the local minimum point . In other words, the KKT conditions are not necessary in this example. This is because no constraint qualification can hold at . In particular, not being a regular point, LICQ does not hold. Moreover, the set being empty (the direction gives while any other direction induces a violation of either one of the constraints), MFCQ does not hold at either.
The following theorem provides a sufficient condition under which any KKT point of an inequality constrained NLP problem is guaranteed to be a global minimum of that problem.
Theorem 2.11: KKT Sufficient Conditions
Let and , be convex and differentiable functions. Consider the problem to minimize subject to . If is a KKT point, then is a global minimum of that problem.
Proof. Consider the function . Since and , are convex functions and , is also convex a a The sum of convex functions is a convex function, a positive scalar times a convex function is again a convex function . Moreover, the stationarity conditions impose that we have . Hence, by Theorem 2.5, is a global minimizer for on , i.e.
In particular, for each such that , , we have
Noting that contains the feasible domain , we therefore showed that is a global minimizer for the problem. □
Example 2.14. Consider the problem
The point with , and , being a KKT point, and both the objective function and the feasible set being convex, by Theorem 2.11 , is a global minimum.
Both second-order necessary and sufficient conditions for inequality constrained NLP problems will be presented later on in 2.8.
In this section, we shall consider nonlinear programming problems with equality constraints of the form:
Based on the material presented in 2.6, it is tempting to convert this problem into an inequality constrained problem, by replacing each equality constraints by two inequality constraints and . Given a feasible point , we would have and . Therefore, there could exist no vector such that and simultaneously, i.e. . In other words, the geometric conditions developed in the previous section for inequality constrained problems are satisfied by all feasible solutions and, hence, are not informative. A different approach must therefore be used to deal with equality constrained problems. After a number of preliminary results in 2.7.1, we shall describe the method of Lagrange multipliers for equality constrained problems in 2.7.2.
An equality constraint defines a set on , which can be seen as a hypersurface.
When considering equality constraints , their intersection forms a (possibly empty) set .
Throughout this section, we shall assume that the equality constraints are differentiable; that is, the set is said to be a differentiable manifold (or smooth manifold). Associated with a point on a differentiable manifold is the tangent set at that point. To formalize this notion, we start by defining curves on a manifold. A curve on a manifold is a continuous application , i.e., a family of points continuously parameterized by in an interval . A curve is said to pass through the point if for some ; the derivative of a curve at , provided it exists, is defined as . A curve is said to be differentiable (or smooth) if a derivative exists for each .
Definition 2.21: Tangent Set
Let be a (differentiable) manifold in , and let . Consider the collection of all the continuously differentiable curves on passing through . Then, the collection of all the vectors tangent to these curves at is said to be the tangent set to at , denoted by .
If the constraints are regular, in the sense of Definition 2.22 below, then is (locally) of dimension , and constitutes a subspace of dimension , called the tangent space.
Definition 2.22: Regular Point (for a Set of Equality Constraints)
Let be differentiable functions on and consider the set . A point is said to be a regular point if the gradient vectors , are linearly independent, i.e.,
Lemma 2.3: Algebraic Characterization of a Tangent Space
Let be differentiable functions on and consider the set . At a regular point , the tangent space is such that
Proof. ... □
Recall that is the Jacobian of the constraints. Therefore, the tangent directions must be orthogonal to the gradient of each constraint at . Note also that defines the kernel of the Jacobian at that is the set . Since we have assumed that is a regular point, the Jacobian is full rank and for the rank-nullity theorem we have that the dimension of the kernel is that is the dimension of the tangent space.
The idea behind the method of Lagrange multipliers for solving equality constrained NLP problems of the form
is to restrict the search of a minimum on the manifold . In other words, we derive optimality conditions by considering the value of the objective function along curves on the manifold S passing through the optimal point.
The following Theorem shows that the tangent space at a regular (local) minimum point is orthogonal to the gradient of the objective function at . This important fact is illustrated in Fig. ??. in the case of a single equality constraint.
Theorem 2.12: Geometric Necessary Condition for a Local Minimum
Let and be continuously differentiable functions on . Suppose that is a local minimum point of the problem to minimize subject to the constraints . Then, is orthogonal to the tangent space , that is:
Proof. By contradiction, assume that there exists a such that . Let be any smooth curve passing through with and . Let also be the function defined as . Since is a local minimum of f on , by Definition 2.11, we have
It follows that is an unconstrained (local) minimum point for , and
We thus get a contradiction with the fact that □
Next, we take advantage of the forgoing geometric characterization, and derive first-order necessary conditions for equality constrained NLP problems.
Theorem 2.13: First-Order Necessary Conditions
Let and be continuously differentiable functions on . Consider the problem to minimize subject to the constraints . If is a local minimum and is a regular point of the constraints, then there exists a unique vector such that:
Proof. Since is a local minimum of on , by Theorem 2.12, we have , i.e. the system
is inconsistent. Consider the following two sets:
Clearly, and are convex and . Then, by the separation Theorem 2.14, there exists a nonzero vector such that
Letting and since can be made an arbitrarily large negative number, it follows that a a Note that if then since the lower bound on the left-hand side would be positive and arbitrarily large. Also, letting b we must have , for each . In particular, letting , it follows that , and thus,
Finally, note that , for otherwise the above equation would contradict the assumption of linear independence of . The result follows by letting , and noting that the linear independence assumption implies the uniqueness of these Lagrangian multipliers. □
Theorem 2.14: Separation of Two Convex Sets
Let and be two nonempty, convex set in and suppose that . Then, there exists a hyperplane that separates and ; that is, there exists a nonzero vector such that
Remark 2.22: Obtaining Candidate Solution Points
The first-order necessary conditions
together with the constraints
give a total of (typically nonlinear) equations in the variables . Hence, these conditions are complete in the sense that they determine, at least locally, a unique solution. However, as in the unconstrained case, a solution to the first-order necessary conditions need not be a (local) minimum of the original problem; it could very well correspond to a (local) maximum point, or some kind of saddle point. These consideration are illustrated in Example 2.15 below.
Remark 2.23: Regularity-Type Assumption
It is important to note that for a local minimum to satisfy the foregoing first-order conditions and, in particular, for a unique Lagrange multiplier vector to exist, it is necessary that the equality constraint satisfy a regularity condition. In other word, the first-order conditions may not hold at a local minimum point that is non-regular. An illustration of these considerations is provided in Example 2.16.
There exists a number of similarities with the constraint qualification needed for a local minimizer of an inequality constrained NLP problem to be KKT point; in particular, the condition that the minimum point be a regular point for the constraints corresponds to LICQ (see Remark 2.21).
Remark 2.24: Lagrangian
It is convenient to introduce the Lagrangian associated with the constrained problem, by adjoining the cost and constraint functions as:
Thus, if is a local minimum which is regular, the first-order necessary conditions are written as
the latter equations being simply a restatement of the constraints. Note that the solution of the original problem typically corresponds to a saddle point of the Lagrangian function.
Example 2.15 (Regular Case). Consider the problem
Observe first that every feasible point is a regular point for the equality constraint (the point (0,0) being infeasible). Therefore, every local minimum is a stationary point of the Lagrangian function by Theorem 2.13 .
The gradient vectors and are given by
so that the first-order necessary conditions read
These three equations can be solved for the three unknowns , and . Two candidate local minimum points are obtained: (i) , and (ii) . These results are illustrated on Fig. 2.8 . It can be seen that only the former actually corresponds to a local minimum point, while the latter gives a local maximum point.
Example 2.16 (Non-Regular Case). Consider the problem
As shown by Fig. 1.13., this problem has only one feasible point, namely, ; that is, is also the unique global minimum of the problem. However, at this point, we have
hence the first-order conditions
cannot be satisfied. This illustrates the fact that a minimum point may not be a stationary point for the Lagrangian if that point is non-regular.
The following theorem provides second-order necessary conditions for a point to be a local minimum of a NLP problem with equality constraints.
Theorem 2.15: Second-Order Necessary Conditions
Let and be twice continuously differentiable functions on . Consider the problem to minimize subject to the constraints . If is a local minimum and is a regular point of the constraints, then there exists a unique vector such that
and
Proof. Note first that directly follows from Theorem 2.13. Let be an arbitrary direction in ; that is, since is a regular point (see Lemma 2.3). Consider any twice differentiable curve , passing through with and . Let be the function defined as . Since is a local minimum of on , is an unconstrained (local) minimum point for . By Theorem 2.4, it follows that
Furthermore, differentiating the relation twice, we obtain
Adding the last two equations yields
and this condition must hold for every such that □
It is useful to shed more light on the derivation of the previous proof. Thus, we carry out the derivative of explicitly. Note that since the curve is in we have . Hence is identically zero for all (i.e. ). We can expand as . The first total derivative is:
We can now differentiate again with respect to time to derive the second total derivative:
Remember that in and and the first-order necessary condition hold . Therefore, the second derivative of in gives:
Hence:
Using the last equation, it is easier to follow the proof of Theorem 2.15
Remark 2.25: Eigenvalues in Tangent Space
In the foregoing Theorem, it is shown that the matrix
restricted to the subspace
plays a key role. Geometrically, the restriction of
to
corresponds to the projection
.
A vector is said to be an eigenvector of if there is a real number such that:
the corresponding is said to be an eigenvalue of a a These definitions coincide with the usual definitions of eigenvector and eigenvalue for real matrices..
Now, to obtain a matrix representation for
, it is necessary to introduce a basis of the subspace
, say
b
b
Note that
. Then, the eigenvalues of
are the same as those of the
matrix
; in particular, they are independent of the particular choice of basis
.
Example 2.17 (Regular Case Continued). Consider again the Example 2.15 . Two candidate local minimum points, (i) and (ii) , were obtained on application of the first-order necessary conditions. The Hessian matrix of the Lagrangian function is given by
and a basis of the tangent subspace at a point is
Therefore,
Since all the feasible and must satisfy the constraint we have:
In particular, for the candidate solution point (i), we have
hence satisfying the second-order necessary conditions (in fact, this point also satisfies the second-order sufficient conditions of optimality discussed hereafter). On the other hand, for the candidate solution point (ii), we get
which does not satisfy the second-order necessary requirement, so this point cannot be a local minimum.
The conditions given in Theorems 2.13 and 2.15 are necessary conditions that must hold at each local minimum point. Yet, a point satisfying these conditions may not be a local minimum. The following theorem provides sufficient conditions for a stationary point of the Lagrangian function to be a (local) minimum, provided that the Hessian matrix of the Lagrangian function is locally convex along directions in the tangent space of the constraints.
Theorem 2.16: Second-Order Sufficient Conditions
Let and be twice continuously differentiable functions on . Consider the problem to minimize subject to the constraints . If and satisfy
and
where , then is a strict local minimum.
Proof. Consider the augmented Lagrangian function
where is a scalar. We have
where . Since satisfy the sufficient conditions and by Lemma 2.4, we obtain
for sufficiently large . being definite positive at ,
Finally, since when , we get
i.e., is a strict local minimum. □
Example 2.18. Consider the problem
The first-order conditions for this problem are
together with the equality constraint. It is easily checked that the point , satisfies these conditions. Moreover,
and a basis of the tangent space to the constraint at is
We thus obtain
which is clearly a definite positive matrix. Hence, is a strict local minimum of (1.18). (Interestingly enough, the Hessian matrix of the objective function itself is indefinite at in this case.)
We close this section by providing insight into the Lagrange multipliers.
Remark 2.26: A first Interpretation of the Lagrange Multipliers
The concept of Lagrange multipliers allows to adjoin the constraints to the objective function. That is, one can view constrained optimization as a search for a vector at which the gradient of the objective function is a linear combination of the gradients of constraints.
Another insightful interpretation of the Lagrange multipliers is as follows. Consider the set of perturbed problems . Suppose that there is a unique regular solution point for each , and let denote the evolution of the optimal solution point as a function of the perturbation parameter . Clearly,
Moreover, since for each , we have:
Denoting by the Lagrange multiplier associated to the constraint in the original problem, we have
Therefore, the Lagrange multipliers can be interpreted as the sensitivity of the objective function with respect to the constraint . Said differently, indicates how much the optimal cost would change, if the constraints were perturbed. This interpretation extends straightforwardly to NLP problems having inequality constraints. The Lagrange multipliers of an active constraints , say , can be interpreted as the sensitivity of with respect to a change in that constraints, as ; in this case, the positivity of the Lagrange multipliers follows from the fact that by increasing , the feasible region is relaxed, hence the optimal cost cannot increase. Regarding inactive constraints, the sensitivity interpretation also explains why the Lagrange multipliers are zero, as any infinitesimal change in the value of these constraints leaves the optimal cost value unchanged.
Remark 2.27: A second Interpretation of the Lagrange Multipliers: Mechanical Analogy
It is possible to introduce an interesting physical interpretation of the KKT condition and its Lagrange multipliers. Consider a ball on a hilly terrain where the elevation of the terrain is described by the function where are the and coordinates. The gravitational potential is , we use normalized units so that . In the unconstrained case the equilibrium point is a stationary point for the potential that is . The generalized force, due to gravity, acting on the ball in a generic point is . Note that the motion along the vertical axis is completely determined by (i.e. the system has two degrees of freedom). An equality constraint can be seen as a rail on which the ball is forced to slide. The rail is modeled as a curve in the form . Such a constraint exert a generalized reaction force at each point normal to the curve, hence in the direction of . The magnitude of this force depends on how much the gravitational force is pushing against the constraint. The static equilibrium condition is . On the other hand, inequality constraints can be seen as barriers, the ball is forced to lie inside these barriers. Consider the inequality constraint . When the constraint is inactive, it is not exerting any force on the ball and hence it is not affecting the equilibrium equation. When the constraint is active, the direction of the force exerted is .Compared to rails, barriers are unilateral constraints and can only exert forces in one sense. This is the physical interpretation of the nonnegativity of its associated Lagrangian multipliers . The equilibrium equation in presence of both equality and inequality constraints become the first KKT condition. This interpretation is illustrated in Figure 2.9. Gravitational and reaction forces are , , . The force balance is therefore the KKT stationary condition with reversed sign :
In this section, we shall consider general, nonlinear programming problems with both equality and inequality constraints,
Before stating necessary and sufficient conditions for such problems, we shall start by revisiting the KKT conditions for inequality constrained problems, based on the method of Lagrange multipliers described in 2.7.
Consider the problem to minimize a function and suppose that is a local minimum point. Clearly, is also a local minimum of the inequality constrained problem where the inactive constraints , have been discarded. Thus, in effect, inactive constraints at don’t matter; they can be ignored in the statement of optimality conditions.
On the other hand, active inequality constraints can be treated to a large extent as equality constraints at a local minimum point. In particular, it should be clear that is also a local minimum to the equality constrained problem
That is, it follows from Theorem 2.13 that, if is a regular point, there exists a unique Lagrange multiplier vector such that
Assigning zero Lagrange multipliers to the inactive constraints, we obtain
This latter condition can be rewritten by means of the following equations:
The argument showing that is a little more elaborate. By contradiction, assume that for some . Let be the matrix whose rows are and . Since is a regular point, the Lagrange multiplier vector is unique. Therefore, the condition
can only be satisfied by with . Because , we know by Theorem 2.1 that there exists a direction such that . In other words,
which contradicts the hypothesis that is a local minimizer of on (see Remark 2.18).
Overall, these results thus constitute the KKT optimality conditions as stated in Theorem 2.10. But although the foregoing development is straightforward, it is somewhat limited by the regularity-type assumption at the optimal solution. Obtaining more general constraint qualifications (see Remark 2.21) requires that the KKT conditions be derived using an alternative approach, e.g., the approach described earlier in Section 2.6. Still, the conversion to equality constrained problem proves useful in many situations, e.g., for deriving second-order sufficiency conditions for inequality constrained NLP problems.
We are now ready to generalize the necessary and sufficient conditions given in Theorems 2.10, 2.13, 2.15 and 2.16 to general NLP problems.
Theorem 2.17: First- and Second-Order Necessary Conditions
Let , and , be twice continuously differentiable functions on . Consider the problem of minimizing subject to the constraints and . If is a local minimum of the optimization problem and is a regular point of the constraints, then there exist unique vectors and such that
and
for all such that and
Theorem 2.18: Second-Order Sufficient Conditions
Let , and , be twice continuously differentiable functions on . Consider the problem of minimizing subject to the constraints and . If there exists , and satisfying the KKT conditions in Theorem 2.17, and
for all such that
where , then is a strict local minimum.
Likewise, the KKT sufficient conditions given in Theorem 2.11 for convex, inequality constrained problems can be generalized to general convex problems as follows:
Theorem 2.19: KKT Sufficient Conditions for Convex Programs
Let , be convex and differentiable functions. Let also be affine functions. Consider the problem to minimize subject to . If satisfies the KKT conditions of Theorem 2.17 then is a global minimizer for on .
Theorem 2.20: Farkas’ Theorem
Let and . Then, exactly one of the following two statements holds:
System 1.
,
System 2.
.
Proof. See, e.g., [6, Theorem 2.4.5] for a proof. □
Farkas’ Theorem is used extensively in the derivation of optimality conditions of (linear and) nonlinear programming problems. A geometrical interpretation of Farkas’ Theorem is shown in Fig. 1.A.1.. If a1, . . . , am denote the rows of A, then system 2 has a solution if c lies in the convex cone generated by a1, . . . , am; On the other hand, system 1 has a solution if the closed convex cone x : Ax 0 and the open half-space x : cTx ¿ 0 have a nonempty intersection.
Corollary 2.1: Gordan’s Theorem
Let . Then, exactly one of the following two statements holds:
System 1.
,
System 2.
.
Proof. System 1 can be equivalently written as where is a scalar and is a vector of ones. Rewriting this in the form of System 1 in Farkas’ Theorem 2.20, we get and where . The associated System 2 by Theorem 2.20 states that and for some , i.e., , and , which is equivalent to the System 2 of the corollary. □
Lemma 2.4
Let and be two symmetric matrices, such that and on the null space of (i.e., ). Then,
Proof. Assume the contrary. Then,
Consider a subsequence converging to some with . Dividing the last equation by , and taking the limit as , we obtain
On the other hand, being positive semidefinite, we must have
Hence . That is, using the hypothesis, . This contradicts the fact that
Nowadays, strong and efficient mathematical programming techniques are available for solving a great variety of nonlinear problems, which are based on solid theoretical results and extensive numerical studies. Approximated functions, derivatives and optimal solutions can be employed together with optimization algorithms to reduce the computational time. The aim of this section is not to describe state-of-the-art algorithms in nonlinear programming, but to explain, in a simple way, a number of modern algorithms for solving nonlinear problems. These techniques are typically iterative in the sense that, given an initial point , a sequence of points, , is obtained by repeated application of an algorithmic rule. The objective is to make this sequence converge to a point in a pre-specified solution set. Typically, the solution set is specified in terms of optimality conditions, such as those presented in 2.5 through 2.8.
We start by recalling a number of concepts in 2.9.1. Then, we discuss the principles of Newton-like algorithms for nonlinear systems in 2.9.2, and use these concepts for the solution of unconstrained optimization problems in 2.9.3. Finally, algorithms for solving general nonlinear problems are presented in 2.9.7, with emphasizes on sequential unconstrained minimization (SUM) and sequential quadratic programming (SQP) techniques.
Two essential questions must be addressed concerning iterative algorithms. The first question, which is qualitative in nature, is whether a given algorithm in some sense yields, at least in the limit, a solution to the original problem; the second question, the more quantitative one, is concerned with how fast the algorithm converges to a solution. We elaborate on these concepts in this subsection.
The convergence of an algorithm is said to be asymptotic when the solution is not achieved after a finite number of iterations; except for particular problems such as linear and quadratic programming, this is generally the case in nonlinear programming. That is, a very desirable property of an optimization algorithm is global convergence:
Definition 2.23: Global Convergence, Local Convergence
An algorithm is said to be globally convergent if, for any initial point , it generates a sequence of points that converges to a point in the solution set. It is said to be locally convergent if there exists a such that for any initial point such that it generates a sequence of points converging to in the solution set.
Most modern mathematical programming algorithms are globally convergent. Locally convergent algorithms are not useful in practice because the neighborhood of convergence is not known in advance and can be arbitrarily small.
Next, what distinguishes optimization algorithms with the global convergence property is the order of convergence:
Definition 2.24: Order of Convergence
The order of convergence of a sequence is the largest nonnegative integer p such that
When and the convergence ratio , the convergence is said to be linear; if , the convergence is said to be superlinear. When , the convergence is said to be quadratic.
Since they involve the limit when , and are a measure of the asymptotic rate of convergence, i.e., as the iterates gets close to the solution; yet, a sequence with a good order of convergence may be very ”slow” far from the solution. Clearly, the convergence is faster when is larger and is smaller. Near the solution, if the convergence rate is linear, then the error is multiplied by at each iteration. The error reduction is squared for quadratic convergence, i.e., each iteration roughly doubles the number of significant digits. The methods that will be studied hereafter have convergence rates varying between linear and quadratic.
Example 2.19. Consider the problem to minimize , subject to .
Let the (point-to-point) algorithmic map be defined defined as . It is easily verified that the sequence obtained by applying the map , with any starting point, converges to the optimal solution , i.e., is globally convergent. For example, with , the algorithm generates the sequence . We have , so that the limit in Definition 2.24 is with ; moreover, for , this limit is infinity. Consequently, linearly with convergence ratio .
On the other hand, consider the (point-to-point) algorithmic map be defined defined as . Again, the sequence obtained by applying converges to , from any starting point. However, we now have , which approaches as . Hence, superlinearly in this case. With , the algorithm generates the sequence . The algorithmic maps and are illustrated on the left and right plots in Fig 1.14., respectively.
It should also be noted that for most algorithms, the user must set initial values for certain parameters, such as the starting point and the initial step size, as well as parameters for terminating the algorithm. Optimization procedures are often quite sensitive to these parameters, and may produce different results, or even stop prematurely, depending on their values. Therefore, it is crucial for the user to understand the principles of the algorithms used, so that he or she can select adequate values for the parameters and diagnose the reasons of a premature termination (failure).
The fundamental approach to most iterative schemes was suggested over 300 years ago by Newton. In fact, Newton’s method is the basis for nearly all the algorithms that are described herein.
Suppose one wants to find the value of the variable such that
where is continuously differentiable. Let us assume that one such solution exists, and denote it by . Let also be a guess for the solution. The basic idea of Newton’s method is to approximate the nonlinear function by the first two terms in its Taylor series expansion about the current point . This yields a linear approximation for the vector function at the new point ,
Using this linear approximation, and provided that the Jacobian matrix is nonsingular, a new estimate for the solution can be computed by solving the previous equation for
Letting , we get the update .
Of course, being a nonlinear function, one cannot expect that , but there is much hope that be a better estimate for the root than the original guess . In other words, we might expect that
If the new point is an improvement, then it makes sense to repeat the process, thereby defining a sequence of points . An algorithm implementing Newton’s method is as follows:
Algorithm 2.1
Initialization Step:
Let be a termination scalar, and choose an initial point .
Let and go to the main step.
Main Step:
It can be shown that the rate of convergence for Newton’s method is quadratic (see Definition 2.24). Loosely speaking, it implies that each successive estimate of the solution doubles the number significant digits, which is a very desirable property for an algorithm to possess.
Theorem 2.21: Quadratic convergence for Newton’s algorithm
Let be continuously differentiable, and consider Newton’s algorithm defined by the map . Let be such that , and suppose that is nonsingular. Let the starting point be sufficiently close to , so that there exist with , and
for each satisfying . Then, Newton’s algorithm converges with a quadratic rate of convergence.
Proof. See [6, Theorem 8.6.5] for a proof. □
But can anything go wrong with Newton’s method?
Lack of Global Convergence First and foremost, if the initial guess is not sufficiently close to the solution, i.e., within the region of convergence, Newton’s method may diverge. Said differently, Newton’s method as presented above does not have the global convergence property (see Definition 2.23 and Example 2.20 hereafter). This is because may not be a valid descent direction far from the solution, and even if , a unit step size might not give a descent in . Globalization strategies, which aim at correcting this latter deficiency, will be presented in 2.9.4 in the context of unconstrained optimization.
Singular Jacobian Matrix A second difficulty occurs when the Jacobian matrix becomes singular during the iteration process, since the correction defined by the Newton’s algorithm is not defined in this case. Note that the assumption in Theorem 2.21 guarantees that cannot be singular. But when the Jacobian matrix is singular at the solution point , then Newton’s method looses its quadratic convergence property.
Computational Efficiency Finally, at each iteration, Newton’s method requires (i) that the Jacobian matrix be computed, which may be difficult and/or costly especially when the expression of is complicated, and (ii) that a linear system be solved. The analytic Jacobian can be replaced by a finite-difference approximation, yet this is costly as additional evaluations of are required at each iteration. With the objective of reducing the computational effort, quasi-Newton methods generate an approximation of the Jacobian matrix, based on the information gathered from the iteration progress. To avoid solving a linear system for the search direction, variants of quasi-Newton methods also exist that generate an approximation of the inverse of the Jacobian matrix. Such methods will be described in 2.9.5 in the context of unconstrained optimization.
Example 2.20. Consider the problem to find a solution to the nonlinear equation
The Newton iteration sequence obtained by starting from is as follows:
Notice the very fast convergence to the solution , as could be expected from Theorem 2.21 . The first three iterations are represented in Fig. 1.15., on the left plot.
However, convergence is not guaranteed for any initial guess. More precisely, it can be shown that Newton’s method actually diverges when the initial guess is chosen such that , with being a solution of ; further, the method cycles indefinitely for . Both these situations are illustrated in the right plot and the bottom plot of Fig. 1.15., respectively.
We now turn to a description of basic techniques used for iteratively solving unconstrained problems of the form:
Many unconstrained optimization algorithms work along the same lines. Starting from an initial point, a direction of movement is determined according to a fixed rule, and then a move is made in that direction so that the objective function value is reduced; at the new point, a new direction is determined and the process is repeated. The main difference between these algorithms rest with the rule by which successive directions of movement are selected. A distinction is usually made between those algorithms which determine the search direction without using gradient information (gradient-free methods), and those using gradient (and higher-order derivatives) information (gradient-based methods). Here, we shall focus our attention on the latter class of methods, and more specifically on Newton-like algorithms.
Throughout this subsection, we shall assume that the objective function is twice continuously differentiable. By theorem 2.3, a necessary condition for to be a local minimum of is . Hence, the idea is to devise an iterative scheme that finds a point satisfying the foregoing condition. Following the techniques discussed earlier in 2.9.2, this can be done by using a Newton-like algorithm, with corresponding to the gradient of , and to its Hessian matrix .
At each iteration, a new iterate is obtained such that the linear approximation to the gradient at that point is zero,
The linear approximation yields the Newton search direction:
As discussed in 2.9.2, if it converges, Newton’s method exhibits a quadratic rate of convergence when the Hessian matrix is nonsingular at the solution point. However, since the Newton iteration is based on finding a zero of the gradient vector, there is no guarantee that the step will move towards a local minimum, rather than a saddle point or even a maximum. To preclude this, the Newton steps should be taken downhill, i.e., the following descent condition should be satisfied at each iteration,
Interestingly enough, with the Newton direction iteration, the descent condition becomes
That is, a sufficient condition to obtain a descent direction at is that the Hessian matrix be positive definite. Moreover, if is positive definite at a local minimizer of , then the Newton iteration converges to when started sufficiently close to . (Recall that, by Theorem 2.8, positive definiteness of is a sufficient condition for a local minimum of to be a strict local minimum.)
We now discuss two important improvements to Newton’s method, which are directly related to the issues discussed in Subsection 2.9.2, namely (i) the lack of global convergence, and (ii) computational efficiency.
Up to this point, the development has focused on the application of Newton’s method. However, even in the simplest one-dimensional applications, Newton’s method has deficiencies (see, e.g., Example 2.20). Methods for correcting global convergence deficiencies are referred to as globalization strategies. It should be stressed than an efficient globalization strategy should only alter the iterates when a problem is encountered, but it should not impede the ultimate behavior of the method, i.e., the quadratic convergence of a Newton’s method should be retained.
In unconstrained optimization, one can detect problems in a very simple fashion, by monitoring whether the next iterate satisfies a descent condition with respect to the actual iterate , e.g., . Then, either one of two globalization strategies can be used to correct the Newton step. The first strategy, known as line search method, is to alter the magnitude of the step; the second one, known as trust region method, is to modify both the step magnitude and direction. We shall only concentrate on the former class of globalization strategies subsequently.
A line search method proceeds by replacing the full Newton step with
where the step-length is chosen such that the objective function is reduced,
Clearly, the optimal choice for would be . The resulting minimization problem can be solved by any one-dimensional exact minimization technique (e.g., Newton’s method itself). However, such techniques are costly in the sense that they often require many iterations to converge and, therefore, many function (or even gradient) evaluations.
In response to this, most modern algorithms implement so-called inexact line search criteria, which aim to find a step-length giving an ”acceptable” decrease in the objective function. Note that sacrificing accuracy, we might impair the convergence of the overall algorithm that iteratively employs such a line search. However, by adopting a line search that guarantees a sufficient degree of descent in the objective function, the convergence of the overall algorithm can still be established.
We now describe one popular definition of an acceptable step-length known as Armijo’s rule. Other popular approaches are the quadratic and cubic fit techniques, as well as Wolfe’s and Glodstein’s tests. Armijo’s rule is driven by two parameters and , which respectively manage the acceptable step-length from being too large or too small (Typical vales are and ). Define the line search function , for , and consider the modified first-order approximation . A step-length is deemed acceptable if the following conditions hold:
The first condition prevents the step-length from being too large, whereas the second condition prevents from being too small. The acceptable region defined by the Armijo’s rule is shown in Fig. 1.16. below.
Another limitation of Newton’s method when applied to unconstrained optimization problems is that the Hessian matrix of the objective function is needed at each iteration, then a linear system must be solved for obtaining the search direction. For many applications, this can be a costly computational burden. In response to this, quasi-Newton methods attempt to construct this information recursively. However, by so doing, the quadratic rate of convergence is lost.
The basic idea for many quasi-Newton methods is that two successive iterates together with the corresponding gradients , yield curvature information by means of the first-order approximation relation
with . In particular, given linearly independent iteration increments an approximation of the Hessian matrix can be obtained as
or for the inverse Hessian matrix as
where
Note that when the objective function is quadratic, the previous relations are exact. Many interesting quasi-Newton methods use similar ways, although more sophisticated, to construct an approximate Hessian matrix that progressively approaches the inverse Hessian. One of the most popular class of quasi-Newton methods (known as the Broyden family) proceeds as follows:
where . It is easily seen that when supplemented with a line search strategy, at each , and hence the Hessian matrix approximations are guaranteed to exist. Moreover, it can be shown that the successive approximates remain positive definite provided that is itself positive definite.
By setting , the previous equation yields the Davidon-Fletcher-Powell (DFP) method, which is historically the first quasi-Newton method, while setting gives the Broyden-Fletcher-Goldfard-Shanno (BFGS) method, for which there is substantial evidence that it is the best general purpose quasi-Newton method currently known.
A Newton-like algorithm including both a line search method (Armijo’s rule) and Hessian recursive update (DFP update) is as follows:
Algorithm 2.2
Initialization Step:
Let be a termination scalar, and choose an initial point and a symmetric, positive definite matrix . Let , and go to the main step.
Main Step:
with and
The standard unconstrained optimization algorithm in MATLAB (i.e. the function fminunc) is an implementation of quasi-Newton’s method, with DFP or BFGS update, and a line search strategy.
Example 2.21. Consider the problem to find a minimum to Rosenbrock’s function
for with . We solved this problem using the function fminunc .
The results are shown in Fig. 1.17. Observe the slow convergence of the iterates far from the optimal solution but the very fast convergence in the vicinity of .
In this subsection, we turn our attention to algorithms for iteratively solving constrained problems of the form:
Many modern deterministic algorithms for constrained NLP problems are based on the (rather natural) principle that, instead of solving a difficult problem directly, one had better solve a sequence of simpler, but related, subproblems, which converges to a solution of the original problem either in a finite number of steps or in the limit. Working along these lines, two classes of algorithms can be distinguished for solution of NLP problems with equality and/or inequality constraints. On the one hand, penalty function and interior-point methods consist of solving the problem as a sequence of unconstrained problems (or problems with simple constraints), so that algorithms for unconstrained optimization can be used. These methods, which do not rely on the KKT theory described earlier in 2.6 through 2.8, shall be briefly presented in 2.9.8 and 2.9.9. On the other hand, Newton-like methods solve NLP problems by attempting to find a point satisfying the necessary conditions of optimality (KKT conditions in general). Sequential quadratic programming (SQP), which shall be presented in 2.9.10, represents one such class of methods.
Methods using penalty functions transform a constrained problem into a single unconstrained problem or a sequence of unconstrained problems. This is done by placing the constraints into the objective function via a penalty parameter in a way that penalizes any violation of the constraints. To illustrate it, consider the NLP problem:
where , and , and are defined on .
In general, a suitable penalty function for the minimization problem is defined by:
where and are continuous functions satisfying the conditions:
Typically, and are of the forms
with a positive integer (taking provides continuously differentiable penalty functions). The function is referred to as the auxiliary function.
Example 2.22. Consider the problem to minimize , subject to . It is immediately evident that the optimal solution lies at the point , and has objective value .
Now, consider the penalty problem to minimize in , where is a large number. Note first that for any , the auxiliary function is convex since it is sum of convex functions. Thus, a necessary and sufficient condition for optimality is that the gradient of be equal to zero, yielding . Thus, the solution of the penalty problem can be made arbitrarily close to the solution of the original problem by choosing sufficiently large. Moreover, , which can also be made arbitrarily close to by taking sufficiently large. These considerations are illustrated in Fig. 1.18. below.
The conclusions of Example 2.22 that the solution of the penalty problem can be made arbitrarily close to the solution of the original problem, and the optimal auxiliary function value arbitrarily close to the optimal objective value, by choosing sufficiently large, is formalized in the following:
Theorem 2.22
Consider a general NLP problem, where , and are continuous functions on and is a nonempty set in . Suppose that the NLP problem has a feasible solution, and let be a continuous penalty function. Suppose further that for each there exists a solution to the problem , and that is contained in a compact subset of . Then,
with . Furthermore, the limit of any convergent subsequence of is an optimal solution to the original problem and as .
Proof. See [6, Theorem 9.2.2] for a proof. □
Note that the assumption that is compact is necessary, for it possible that the optimal objective values of the original and penalty problems are not equal otherwise. Yet, this assumption is not very restrictive in most practical cases as the variables usually lie between finite upper and lower bounds. Note also that no restriction is imposed on , and other than continuity. However, the application of an efficient solution procedure for the (unconstrained) auxiliary problems may impose additional restriction on these functions (see 2.9.3).
Under the conditions that (i) , , and , are continuously differentiable, and (ii) is a regular point the solution to the penalty problem can be used to recover the Lagrange multipliers associated with the constraints at optimality. In the particular case where , we get
The larger , the better the approximation of the Lagrange multipliers:
Example 2.23. Consider the same problem as in Example 1.71. The auxiliary function being continuously differentiable, the Lagrange multiplier associated to the inequality constraint can be recovered as (assuming ). Note that the exact value of the Lagrange multiplier is obtained for each here, because is a linear constraint.
From a computational viewpoint, superlinear convergence rates might be achievable, in principle, by applying Newton’s method (or its variants such as quasi-Newton methods). Yet, one can expect ill-conditioning problems when is taken very large in the penalty problem. With a large , more emphasis is placed on feasibility, and most procedures for unconstrained optimization will move quickly towards a feasible point. Even though this point may be far from the optimum, both slow convergence and premature termination can occur due to very small step size and finite precision computations (round-off errors).
As a result of the above mentioned difficulties associated with large penalty parameters, most algorithms using penalty functions employ a sequence of increasing penalty parameters. With each new value of the penalty parameter, an optimization technique is employed, starting with the optimal solution corresponding to the previously chosen parameters value. Such an approach is often referred to as sequential unconstrained minimization (SUM) technique. This way, a sequence of infeasible points is typically generated, whose limit is an optimal solution to the original problem (hence the term exterior penalty function approach).
To conclude our discussion on the penalty function approach, we give an algorithm to solve a general NLP problem, where the penalty function used are of the form specified by and
Algorithm 2.3
Initialization Step:
Let be a termination scalar, and choose an initial point , a penalty parameter , and a scalar . Let and go to the main step.
Main Step:
Similar to penalty functions, barrier functions can also be used to transform a constrained problem into an unconstrained problem (or into a sequence of unconstrained problems). These functions act as a barrier and prevent the iterates from leaving the feasible region. If the optimal solution occurs at the boundary of the feasible domain, the procedure moves from the interior to the boundary of the domain, hence the name interior-point methods. To illustrate these methods, consider the NLP problem:
where is a subset of , and , are continuous on . Note that equality constraints, if any, should be accommodated within the set . (In the case of affine equality constraints, one can possibly eliminate them after solving for some variables in terms of the others, thereby reducing the dimension of the problem.) The reason why this treatment is necessary is because barrier function methods require the set to be nonempty; this would obviously be not possible if the equality constraints were accommodated within the set of inequalities as and .
A barrier problem formulates as:
where . Ideally, the barrier function should take value zero on the region , and value on its boundary. This would guarantee that the iterates do not leave the domain provided the minimization problem started at an interior point. However, this discontinuity poses serious difficulties for any computational procedure. Therefore, this ideal construction of is replaced by the more realistic requirement that be nonnegative and continuous over the region and approach infinity as the boundary is approached from the interior:
where is a continuous function over that satisfies the conditions
In particular, approaches the ideal barrier function described above as approaches zero.
Typically barrier functions are
The following barrier function, known as Frisch’s logarithmic barrier function, is also widely used
Actually the Frisch’s logarithmic barrier does not satisfy the nonnegativity requirement for . However the requirement on can be relaxed and it is sufficient that be positive close to .
The function is referred to as the auxiliary function.
Given , evaluating seems no simpler than solving the original problem because of the constraint . However, starting the optimization from a point in the region yields an optimal point in , even when the constraint is ignored. This is because approaches infinity as the iterates approach the boundary of from within , hence preventing them from leaving the set . This is formalized in the following:
Theorem 2.23
Consider a NLP problem with inequality constraints , where and are continuous functions on and is a nonempty closed set in . Suppose that the minimization problem has an optimal solution with the property that, given any neighborhood around , there exists an such that . Suppose further that for each , there exists a solution to the problem . Then,
with . Furthermore, the limit of any convergent subsequence of is an optimal solution to the original problem, and as .
Proof. See [6, Theorem 9.4.3] for a proof. □
Under the conditions that (i) , and are continuously differentiable, and (ii) is a regular point, the solution to the barrier function problem can be used to recover the Lagrange multipliers associated with the constraints at optimality. In the particular case where , we get:
The approximation of the Lagrange multipliers, gets better as gets closer to ,
Example 2.24. Consider the problem to minimize , subject to , the solution of which lies at the point with objective value . Now, consider the barrier function problem to minimize in , where is a large number. Note first that for any , the auxiliary function is convex 1 1 is a concave function on the convex set , therefore is convex and thus is sum of convex functions. . Thus, a necessary and sufficient condition for optimality is that the gradient of be equal to zero, yielding (assuming ). Thus, the solution of the penalty problem can be made arbitrarily close to the solution of the original problem by choosing sufficiently close to zero. Moreover, , which can also be made arbitrarily close to by taking sufficiently close to zero. These considerations are illustrated in Fig. 1.19. below. Regarding the Lagrange multiplier associated to the inequality constraint , the objective and constraint functions being continuously differentiable, an approximate value can be obtained as . Here again, the exact value of the Lagrange multiplier is obtained for each because is a linear constraint.
The use of barrier functions for solving constrained NLP problems also faces several computational difficulties. First, the search must start with a point such that , and finding such a point may not be an easy task for some problems. Also, because of the structure of the barrier function , and for small values of the parameter , most search techniques may face serious ill-conditioning and difficulties with round-off errors while solving the problem to minimize over , especially as the boundary of the region is approached. Accordingly, interior-point algorithms employ a sequence of decreasing penalty parameters as ; with each new value , an optimal solution to the barrier problem is sought by starting from the previous optimal solution. As in the exterior penalty function approach, it is highly recommended to use suitable second-order Newton or quasi-Newton methods for solving the successive barrier problems.
We describe below a scheme using barrier functions for optimizing a NLP problem.
Algorithm 2.4
Initialization Step:
Let be a termination scalar, and choose an initial point , with . Let , , , and go to the main step.
Main Step:
Note that although the constraint may be ignored, it is considered in the problem formulation as most line search methods use discrete steps, and a step could lead to a point outside the feasible region (where the value of the barrier function is a large negative number), when close to the boundary. Therefore, the problem can effectively be treated as an unconstrained optimization problem only if an explicit check for feasibility is made.
In recent years, there has been much excitement because some variants of the interior-point algorithm can be shown to be polynomial in time for many classes of convex programs. Moreover, interior-point codes are now proving to be highly competitive with codes based on other algorithms, such as SQP algorithms presented subsequently. A number of free and commercial interior-point solvers are given in Tab. 1.1. below.
Sequential quadratic programming (SQP) methods, also known as successive, or recursive, quadratic programming, employ Newton’s method (or quasi-Newton methods) to directly solve the KKT conditions for the original problem. As a result, the accompanying subproblem turns out to be the minimization of a quadratic approximation to the Lagrangian function subject to a linear approximation to the constraints. Hence, this type of process is also known as a projected Lagrangian, or the Newton-Lagrange, approach. By its nature, this method produces both primal and dual (Lagrange multiplier) solutions.
To present the concept of SQP, consider first the equality constrained nonlinear program:
where , and , are twice continuously differentiable. We shall also assume throughout that the equality constraints are linearly independent at a solution . (The extension for including inequality constraints is considered subsequently.)
By Theorem 2.13, the first-order necessary conditions of optimality for an equality constrained NLP require a primal solution and a Lagrange multiplier vector such that:
where . Now, consider a Newton-like method to solve the latter system of nonlinear equations. Given an iterate , a new iterate is obtained by solving the first-order approximation:
Substituting , the first row of the system of equations is:
The Lagrange multiplier can be simplified and thus:
The second row of the system of equations is independent of the Lagrange multipliers, it is:
Therefore the system can be simplified as:
Where we have defined the Newton step on as . The simplified system can be solved for if a solution exists. Setting , and incrementing by , we can then repeat the process until happens to solve the linear system. When this occurs, if at all, we shall have found a stationary point to the equality constrained NLP.
Interestingly enough, a quadratic programming (QP) minimization subproblem can be employed in lieu of the foregoing linear system to find any optimal solution for the linearized NLP problem. Given we define the subproblem as:
First, note that an optimum to , if it exists, together with the set of Lagrange multipliers associated with the linearized constraints satisfies the linearized KKT conditions. Second, the objective function of represents not just a quadratic approximation for but also incorporates an additional term to represent the curvature of the constraints.
Algorithm 2.5: SQP
Initialization Step:
Choose and initial primal/dual point , let , and go to the main step.
Main Step:
In the case is a regular stationary solution for the NLP problem which, together with a set of Lagrange multipliers , satisfies the second-order sufficiency conditions of Theorem 2.16, then the matrix
can be shown to be nonsingular. Hence, the above rudimentary SQP algorithm exhibits a quadratic rate of convergence by Theorem 2.21.
We now consider the inclusion of inequality constraints in the NLP problem, that is:
where the functions are twice continuously differentiable.
Given an iterate , where and are the Lagrange multiplier estimates for the equality and inequality constraints, respectively, consider the following QP subproblem as a direct extension of :
where . Note that the KKT conditions for require that, in addition to primal feasibility, Lagrange multipliers , be found such that:
with and unrestricted in sign. Clearly, if , then together with , yields a KKT solution to the NLP original problem. Otherwise, we set as before, increment by , and repeat the process. Regarding convergence rate, it can be shown that if is a regular KKT solution which, together with , satisfies the second-order sufficient conditions of Theorem 2.18, and if is initialized sufficiently close to , then the foregoing iterative procedure shall exhibit a quadratic convergence rate.
The SQP method, as presented thus far, obviously shares the disadvantages of Newton’s method: (i) it requires second-order derivatives to be calculated, which in addition might not be positive definite, and (ii) it lacks the global convergence property.
It can be shown that this modification to the rudimentary SQP algorithm, similar to the quasi-Newton modification of Newton’s algorithm, looses the quadratic convergence rate property. Instead, it can be shown that the convergence is superlinear when initialized sufficiently close to a solution that satisfies both regularity and second-order sufficiency conditions. However, this superlinear convergence rate is strongly based on the use of unit step sizes (see point (ii) below).
which satisfies the important property that is a local minimizer of , provided satisfies the second-order sufficient conditions (see Theorem 2.18) and the penalty parameter is so chosen that , and , . Yet, the merit function is not differentiable at those with either or , and it can be unbounded below even though is a local minimizer.
with , has similar properties to the merit function, provided is chosen large enough, and is continuously differentiable (although its Hessian matrix is discontinuous). Yet, for to be a (local) minimizer of , it is necessary that and .
An SQP algorithm including the modifications discussed in (i) and (ii) is as follows:
Algorithm 2.6: Improved SQP
Initialization Step:
Choose and initial primal/dual point , with , and a positive definite matrix . Let , and go to the main step.
Main Step:
What Can Go Wrong?
The material presented up to this point was intended to give the reader an understanding of how SQP methods should work. Things do not go so smoothly in practice though. We now discuss a number of common difficulties that can be encountered, and suggest remedial actions to correct the deficiencies. Because real applications may involve more than a single difficulty, the user must be prepared to correct all problems before obtaining satisfactory performance from an optimization software.
One of the most common difficulties occurs when the NLP problem has infeasible constraints, i.e., the constraints taken all together have no solution. Applying general-purpose SQP software to such problems typically produces one or more of the following symptoms:
Although robust SQP software attempt to diagnose this situation, ultimately the only remedy is to reformulate the NLP.
In contrast to the previous situation, it is possible that the constraints be consistent, but the Jacobian matrix of the active constraints, at the solution point, be either ill-conditioned or rank deficient. This situation was illustrated in Examples 1.43 and 1.55. The application of general-purpose SQP software is likely to produce the following symptoms:
Note that many symptoms of rank-deficient constraints are the same as those of inconsistent constraints. It is therefore quite common to confuse this deficiency with inconsistent constraints. Again, the remedy is to reformulate the problem.
A third type of difficulty occurs when the NLP problem contains redundant constraints. Two types of redundancy may be distinguished. In the first type, some of the constraints are unnecessary to the problem formulation, which typically results in the following symptoms:
In the second type, the redundant constraints give rise to rank deficiency, and the problem then exhibits symptoms similar to the rank-deficient case discussed previously. Obviously, the remedy is to reformulate the problem by eliminating the redundant constraints.
Perhaps the biggest obstacle encountered in the practical application of SQP methods (as well as many other NLP methods including SUM techniques) is the presence of discontinuous behavior. All of the numerical methods described herein assume continuous and differentiable objective function and constraints. Yet, there are many common examples of discontinuous functions in practice, including: IF tests in codes; absolute value, , and functions; linear interpolation of data; internal iterations such as root finding; etc.
Regarding SQP methods, the standard QP subproblems are no longer appropriate when discontinuities are present. In fact, the KKT necessary conditions simply do not apply! The most common symptoms of discontinuous functions are:
The remedy consists in reformulating discontinuous problems into smooth problems: for absolute value, min, and max functions, tricks can be used that introduce slack variables and additional constraints; linear data interpolation can be replaced by higher order interpolation schemes that are continuous through second derivatives; internal iterations can also be handled via additional NLP constraints; etc.
Any SQP code requires that the user supply the objective function and constraint values, as well as their gradient (and possibly their Hessian too). In general, the user is proposed the option to calculate the gradients via finite differences, e.g.,
However, this may cause the problem to stop prematurely. First of all, the choice of the perturbation vector is highly non trivial. If too large a value clearly provides inaccurate estimates, too small a value may also result in very bad estimates due to finite arithmetic precision computations. Therefore, one must try to find a trade-off between these two extreme situations. The difficulty stems from the fact that a trade-off may not necessarily exist if the requested accuracy for the gradient is too high. In other word, the error made in the finite-difference approximation of a gradient cannot be made as small as desired. Further, the maximum accuracy that can be achieved with finite difference is both problem dependent (e.g., badly-scaled functions are more problematic than well-scaled functions) and machine dependent (e.g., double precision computations provides more accurate estimates than single precision computations). Typical symptoms of inaccurate gradient estimates in an SQP code are:
The situation can be understood as follows. Assume that the gradient estimate is contaminated with noise. Then, instead of computing the true value , we get . But since the iteration seeks a point such that , we can expect either a degraded rate of convergence or, worse, no convergence at all, because ultimately the gradient will be dominated by noise.
To avoid these problems, the user should always consider providing the gradients explicitely to the SQP solver, instead of relying on finite-difference estimates. For large-scale problems, this is obviously a time-consuming and error-prone task. In response to this, efficient algorithmic differentiation tools (also called automatic differentiation) have been developed within the last fifteen years. The idea behind it is that, given a piece of program calculating a number of function values (e.g., in fortran77 or C language), an auxiliary program is generated that calculates the derivatives of these functions.
Scaling affects everything! Poor scaling can make a good algorithm behave badly. Scaling changes the convergence rate, termination tests, and numerical conditioning.
The most common way of scaling a problem is by introducing scaled variables of the form
for with and being scale weights and shifts, respectively. Likewise, the objective function and constraints are commonly scaled using
for . The idea is to let the optimization algorithm work with the well-scaled quantities in order to improve performance. However, what well-scaled quantities mean is hard to define, although conventional wisdom suggests the following hints: